24 research outputs found
Measuring Societal Biases in Text Corpora via First-Order Co-occurrence
Text corpora are used to study societal biases, typically through statistical
models such as word embeddings. The bias of a word towards a concept is
typically estimated using vectors similarity, measuring whether the word and
concept words share other words in their contexts. We argue that this
second-order relationship introduces unrelated concepts into the measure, which
causes an imprecise measurement of the bias. We propose instead to measure bias
using the direct normalized co-occurrence associations between the word and the
representative concept words, a first-order measure, by reconstructing the
co-occurrence estimates inherent in the word embedding models. To study our
novel corpus bias measurement method, we calculate the correlation of the
gender bias values estimated from the text to the actual gender bias statistics
of the U.S. job market, provided by two recent collections. The results show a
consistently higher correlation when using the proposed first-order measure
with a variety of word embedding models, as well as a more severe degree of
bias, especially to female in a few specific occupations
Enhancing the Ranking Context of Dense Retrieval Methods through Reciprocal Nearest Neighbors
Sparse annotation poses persistent challenges to training dense retrieval
models; for example, it distorts the training signal when unlabeled relevant
documents are used spuriously as negatives in contrastive learning. To
alleviate this problem, we introduce evidence-based label smoothing, a novel,
computationally efficient method that prevents penalizing the model for
assigning high relevance to false negatives. To compute the target relevance
distribution over candidate documents within the ranking context of a given
query, we assign a non-zero relevance probability to those candidates most
similar to the ground truth based on the degree of their similarity to the
ground-truth document(s).
To estimate relevance we leverage an improved similarity metric based on
reciprocal nearest neighbors, which can also be used independently to rerank
candidates in post-processing. Through extensive experiments on two large-scale
ad hoc text retrieval datasets, we demonstrate that reciprocal nearest
neighbors can improve the ranking effectiveness of dense retrieval models, both
when used for label smoothing, as well as for reranking. This indicates that by
considering relationships between documents and queries beyond simple geometric
distance we can effectively enhance the ranking context.Comment: EMNLP 202
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models
Large pretrained language models (LMs) have become the central building block
of many NLP applications. Training these models requires ever more
computational resources and most of the existing models are trained on English
text only. It is exceedingly expensive to train these models in other
languages. To alleviate this problem, we introduce a novel method -- called
WECHSEL -- to efficiently and effectively transfer pretrained LMs to new
languages. WECHSEL can be applied to any model which uses subword-based
tokenization and learns an embedding for each subword. The tokenizer of the
source model (in English) is replaced with a tokenizer in the target language
and token embeddings are initialized such that they are semantically similar to
the English tokens by utilizing multilingual static word embeddings covering
English and the target language. We use WECHSEL to transfer the English RoBERTa
and GPT-2 models to four languages (French, German, Chinese and Swahili). We
also study the benefits of our method on very low-resource languages. WECHSEL
improves over proposed methods for cross-lingual parameter transfer and
outperforms models of comparable size trained from scratch with up to 64x less
training effort. Our method makes training large language models for new
languages more accessible and less damaging to the environment. We make our
code and models publicly available.Comment: NAACL 202
Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models
Volatility prediction--an essential concept in financial markets--has
recently been addressed using sentiment analysis methods. We investigate the
sentiment of annual disclosures of companies in stock markets to forecast
volatility. We specifically explore the use of recent Information Retrieval
(IR) term weighting models that are effectively extended by related terms using
word embeddings. In parallel to textual information, factual market data have
been widely used as the mainstream approach to forecast market risk. We
therefore study different fusion methods to combine text and market data
resources. Our word embedding-based approach significantly outperforms
state-of-the-art methods. In addition, we investigate the characteristics of
the reports of the companies in different financial sectors
Fairness of recommender systems in the recruitment domain: an analysis from technical and legal perspectives
Recommender systems (RSs) have become an integral part of the hiring process, be it via job advertisement ranking systems (job recommenders) for the potential employee or candidate ranking systems (candidate recommenders) for the employer. As seen in other domains, RSs are prone to harmful biases, unfair algorithmic behavior, and even discrimination in a legal sense. Some cases, such as salary equity in regards to gender (gender pay gap), stereotypical job perceptions along gendered lines, or biases toward other subgroups sharing specific characteristics in candidate recommenders, can have profound ethical and legal implications. In this survey, we discuss the current state of fairness research considering the fairness definitions (e.g., demographic parity and equal opportunity) used in recruitment-related RSs (RRSs). We investigate from a technical perspective the approaches to improve fairness, like synthetic data generation, adversarial training, protected subgroup distributional constraints, and post-hoc re-ranking. Thereafter, from a legal perspective, we contrast the fairness definitions and the effects of the aforementioned approaches with existing EU and US law requirements for employment and occupation, and second, we ascertain whether and to what extent EU and US law permits such approaches to improve fairness. We finally discuss the advances that RSs have made in terms of fairness in the recruitment domain, compare them with those made in other domains, and outline existing open challenges
CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking
Contrastive learning has been the dominant approach to training dense
retrieval models. In this work, we investigate the impact of ranking context -
an often overlooked aspect of learning dense retrieval models. In particular,
we examine the effect of its constituent parts: jointly scoring a large number
of negatives per query, using retrieved (query-specific) instead of random
negatives, and a fully list-wise loss. To incorporate these factors into
training, we introduce Contextual Document Embedding Reranking (CODER), a
highly efficient retrieval framework. When reranking, it incurs only a
negligible computational overhead on top of a first-stage method at run time
(delay per query in the order of milliseconds), allowing it to be easily
combined with any state-of-the-art dual encoder method. After fine-tuning
through CODER, which is a lightweight and fast process, models can also be used
as stand-alone retrievers. Evaluating CODER in a large set of experiments on
the MS~MARCO and TripClick collections, we show that the contextual reranking
of precomputed document embeddings leads to a significant improvement in
retrieval performance. This improvement becomes even more pronounced when more
relevance information per query is available, shown in the TripClick
collection, where we establish new state-of-the-art results by a large margin.Comment: EMNLP 202